Deep Fake Challenge

Michael Zeng, Richard Ryu, Adam Sohn

Backgroud

  • The goal of the challenge is to spur researchers around the world to build innovative new technologies that can help detect deepfakes and manipulated media.
  • The challenge is backed by AWS, Facebook, Microsoft, the Partnership on AI’s Media Integrity Steering Committee, and academics have come together to build the Deepfake Detection Challenge (DFDC).

EDA

The exploratory analysis focuses on the sample videos:.

Train/Test # Total # Fake # Real # Original
Train 400 323 77 209
Test 400 ? ? ?
In [10]:
metadata = pd.read_json('data/train_sample_videos/metadata.json').T
metadata.head()
Out[10]:
label original split
aagfhgtpmv.mp4 FAKE vudstovrck.mp4 train
aapnvogymq.mp4 FAKE jdubbvfswz.mp4 train
abarnvbtwb.mp4 REAL None train
abofeumbvv.mp4 FAKE atvmxvwyns.mp4 train
abqwwspghj.mp4 FAKE qzimuostzz.mp4 train

Here is an example of a real video, together with 2 of the fakes that are based on it:

In [36]:
# this is a real one
play_video('ellavthztb.mp4')
Out[36]:
In [37]:
metadata[metadata.original == 'ellavthztb.mp4']
Out[37]:
label original split
bnjcdrfuov.mp4 FAKE ellavthztb.mp4 train
dbzpcjntve.mp4 FAKE ellavthztb.mp4 train
In [49]:
# and this is one of the fakes
play_video('dbzpcjntve.mp4')
Out[49]:

Observations so far:

  • The fakes could be quite subtle, but we know all the changes are done to the facial areas only. Therefore, recognizing faces and cropping out facial areas will be the starting point of the project.
  • The data set is imbalanced, at least for the sample set of 400, therefore upsampling might be required to rebalance it.
  • The number of originals are much bigger than the REALs in the set provided, therefore "memorizing" the REALs is not the way to do it.

Facial Recognition

  • We are using MTCNN from facenet (https://github.com/timesler/facenet-pytorch) as the primary tool for recognizing faces in the videos.
  • Every video is 10 seconds long, and contains 300 frames. We are sampling out 30 frames from every video and stack them together in a list.

Here is a demo (noted this video is a fake):

In [58]:
d = display.display(frames_tracked[0], display_id=True)
i = 1
try:
    while i <= len(frames_tracked):
        d.update(frames_tracked[i % len(frames_tracked)])
        i += 1
except KeyboardInterrupt:
    pass

The package also allows us to detect key facials points:

In [60]:
# Visualize with key facial points:
fig, ax = plt.subplots(3, 3, figsize=(18, 12))
for i in range(9):
    ax[int(i / 3), i % 3].imshow(view_frames[i])
    ax[int(i / 3), i % 3].axis('off')
    for box, landmark in zip(view_boxes[i], view_landmarks[i]):
        ax[int(i / 3), i % 3].scatter(*np.meshgrid(box[[0, 2]], box[[1, 3]]), s=8)
        ax[int(i / 3), i % 3].scatter(landmark[:, 0], landmark[:, 1], s=6)

Progress so far

We adapted and wrote our own pipeline for facial recoginition. It is running at about 15 fps on my Macbook Pro, which is not bad at all. However, here are some challenges we are facing:

  • Frames/videos with no face detected: usually when the video is too dark.
In [75]:
play_video('djvutyvaio.mp4')
Out[75]:
  • Frames/videos with "too many" faces detected: faces on the painting or T-shirt got detected.
In [79]:
d = display.display(frames_tracked[0], display_id=True)
i = 1
try:
    while i <= len(frames_tracked):
        d.update(frames_tracked[i % len(frames_tracked)])
        i += 1
except KeyboardInterrupt:
    pass
  • Frames/videos with "no-a-face" face detected: I saw hands detected as faces.

Next steps:

  • Week 10: spent one more week tuning MTCNN parameters to find a good balance between precision and recall
  • Week 11: explore ways that encodes faces such that each video can be represented by multi-dimensional tensor and fed into a CNN or RNN as a sample model
  • Week 12-14: Build data pipeline that takes in the full training set and train on one or multiple V100s
  • Week 15: Submit and present